Elasticsearch：分词器中的 token 过滤器使用示例

您所在的位置：网站首页 › filter classic › Elasticsearch：分词器中的 token 过滤器使用示例

Elasticsearch：分词器中的 token 过滤器使用示例

2023-05-14 15:51| 来源: 网络整理| 查看: 265

分词器在 Elasticsearch 的使用中非常重要。分词器中的过滤器可以帮我们对最终的分词进行处理，从而使得我们得到的最终分词会影响存储的大小和搜索的方式。在今天的文章中，我来分享一下一些常用的分词器中的 token 过滤器。更多有关 token 过滤器的内容可以在 Elastic 的官方文档查询。有关更多关于 analyzer 的阅读，请参考我之前的文章 “Elasticsearch: analyzer”。

如上图所示，在分词器的构成中，它可以含有0或多个 char filters，有且只有一个 tokenizer，0或多个 token filters。

安装

在今天的展示中，我们需要安装中文最为流行的 IK 分词器。详细的安装步骤请参考文章 “Elasticsearch：IK 中文分词器”。

Apostrophe token filter

去掉撇号后的所有字符，包括撇号本身。这个在英文中比较常见。比如，我们写如下的句子：

This is Tom's clothes. 复制代码

在上面，我们可以看到有一个 ’ 符号。在实际的分词中，我们希望去掉 ' 符号后面的所有字符。我们可以使用如下的例子来进行展示：

1. GET /_analyze 2. { 3. "tokenizer" : "standard", 4. "filter" : ["apostrophe"], 5. "text" : "Istanbul'a veya Istanbul'dan" 6. } 复制代码

上面的 filter 产生如下的结果：

` 1. { 2. "tokens": [3. {4. "token": "Istanbul",5. "start_offset": 0,6. "end_offset": 10,7. "type": "",8. "position": 09. },10. {11. "token": "veya",12. "start_offset": 11,13. "end_offset": 15,14. "type": "",15. "position": 116. },17. {18. "token": "Istanbul",19. "start_offset": 16,20. "end_offset": 28,21. "type": "",22. "position": 223. }24. ] 25. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

在实际的使用中，我们可以使用如下的例子来进行展示：

` 1. PUT test 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "my_analyzer": { 7. "type": "custom", 8. "tokenizer": "standard", 9. "filter": [ 10. "apostrophe", 11. "lowercase" 12. ] 13. } 14. } 15. } 16. }, 17. "mappings": { 18. "properties": { 19. "text": { 20. "type": "text", 21. "analyzer": "my_analyzer" 22. } 23. } 24. } 25. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

在上面，我们定义了一个叫做 test 的索引，并且它的 text 字段使用 my_analyzer。在没有特定指定 search_analyzer 的情况下，分词和搜索将使用同样的一个分词器：my_analyzer。我们使用如下的一个命令来写入一个文档：

1. PUT test/_doc/1 2. { 3. "text": "Istanbul'a veya Istanbul'dan" 4. } 复制代码

那么我们可以使用如下的一个搜索来搜索该文档：

1. GET test/_search 2. { 3. "query": { 4. "match": { 5. "text": "Istanbul" 6. } 7. } 8. } 复制代码

上面的搜索结果为：

将不在 Basic Latin Unicode 块中的字母、数字和符号字符（前 127 个 ASCII 字符）转换为它们的 ASCII 等效字符（如果存在）。例如，过滤器将 à 更改为 a。

以下 analyze API 请求使用 asciifolding 过滤器删除 açaí à la carte 中的变音符号：

1. GET /_analyze 2. { 3. "tokenizer" : "standard", 4. "filter" : ["asciifolding"], 5. "text" : "açaí à la carte" 6. } 复制代码

上面的命令返回：

` 1. { 2. "tokens": [3. {4. "token": "acai",5. "start_offset": 0,6. "end_offset": 4,7. "type": "",8. "position": 09. },10. {11. "token": "a",12. "start_offset": 5,13. "end_offset": 6,14. "type": "",15. "position": 116. },17. {18. "token": "la",19. "start_offset": 7,20. "end_offset": 9,21. "type": "",22. "position": 223. },24. {25. "token": "carte",26. "start_offset": 10,27. "end_offset": 15,28. "type": "",29. "position": 330. }31. ] 32. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码 Classic token filter

对经典分词器生成的术语执行可选的后处理。

此过滤器从单词末尾删除英语所有格 ('s) 并从首字母缩略词中删除点。它使用 Lucene 的 ClassicFilter。例如：

1. GET /_analyze 2. { 3. "tokenizer" : "classic", 4. "filter" : ["classic"], 5. "text" : "The 2 Q.U.I.C.K. Brown-Foxes jumped over the lazy dog's bone." 6. } 复制代码

上面将返回：

` 1. { 2. "tokens": [3. {4. "token": "The",5. "start_offset": 0,6. "end_offset": 3,7. "type": "",8. "position": 09. },10. {11. "token": "2",12. "start_offset": 4,13. "end_offset": 5,14. "type": "",15. "position": 116. },17. {18. "token": "QUICK",19. "start_offset": 6,20. "end_offset": 16,21. "type": "",22. "position": 223. },24. {25. "token": "Brown",26. "start_offset": 17,27. "end_offset": 22,28. "type": "",29. "position": 330. },31. {32. "token": "Foxes",33. "start_offset": 23,34. "end_offset": 28,35. "type": "",36. "position": 437. },38. {39. "token": "jumped",40. "start_offset": 29,41. "end_offset": 35,42. "type": "",43. "position": 544. },45. {46. "token": "over",47. "start_offset": 36,48. "end_offset": 40,49. "type": "",50. "position": 651. },52. {53. "token": "the",54. "start_offset": 41,55. "end_offset": 44,56. "type": "",57. "position": 758. },59. {60. "token": "lazy",61. "start_offset": 45,62. "end_offset": 49,63. "type": "",64. "position": 865. },66. {67. "token": "dog",68. "start_offset": 50,69. "end_offset": 55,70. "type": "",71. "position": 972. },73. {74. "token": "bone",75. "start_offset": 56,76. "end_offset": 60,77. "type": "",78. "position": 1079. }80. ] 81. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

显然，在 Q.U.I.CK. 中的点都被删除了，同时 dog's 中的 ‘s 被去掉了。

Conditional token filter

将一组 token 过滤器应用于与提供的谓词脚本中的条件匹配的标记。此过滤器使用 Lucene 的 ConditionalTokenFilter。

以下 analyze API 请求使用条件过滤器来匹配 THE QUICK BROWN FOX 中少于 5 个字符的标记。然后它将 lowercase 过滤器应用于那些匹配的标记，将它们转换为小写。

1. GET /_analyze 2. { 3. "tokenizer": "standard", 4. "filter": [ 5. { 6. "type": "condition", 7. "filter": [ "lowercase" ], 8. "script": { 9. "source": "token.getTerm().length() < 5" 10. } 11. } 12. ], 13. "text": "THE QUICK BROWN FOX" 14. } 复制代码

上面的命令返回：

` 1. { 2. "tokens": [3. {4. "token": "the",5. "start_offset": 0,6. "end_offset": 3,7. "type": "",8. "position": 09. },10. {11. "token": "QUICK",12. "start_offset": 4,13. "end_offset": 9,14. "type": "",15. "position": 116. },17. {18. "token": "BROWN",19. "start_offset": 10,20. "end_offset": 15,21. "type": "",22. "position": 223. },24. {25. "token": "fox",26. "start_offset": 16,27. "end_offset": 19,28. "type": "",29. "position": 330. }31. ] 32. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

请注意上面的 script 指的是 painless script。在上面，我们看到虽然有 lowercase 过滤器，但是它仅仅作用于 token 的长度小于 5 的 token 上面。从输出中，我们可以看到，BROWN 及 QUICK 并没有变为小写，原因是他们的长度是5，不满足条件。

自定义并添加到分词器

要自定义 conditional 过滤器，请将其复制以创建新的自定义标记过滤器的基础。你可以使用其可配置参数修改过滤器。

例如，以下创建索引 API 请求使用自定义 conditional 过滤器来配置新的自定义分词器。自定义条件过滤器匹配流中的第一个 token。然后它使用反向过滤器反转匹配的 token。

` 1. PUT /palindrome_list 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "whitespace_reverse_first_token": { 7. "tokenizer": "whitespace", 8. "filter": [ "reverse_first_token" ] 9. } 10. }, 11. "filter": { 12. "reverse_first_token": { 13. "type": "condition", 14. "filter": [ "reverse" ], 15. "script": { 16. "source": "token.getPosition() === 0" 17. } 18. } 19. } 20. } 21. } 22. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码 Fingerprint token filter

从 token 流中排序和删除重复的 token，然后将流连接成单个输出 token。例如，此过滤器将 [ the, fox, was, very, very, quick ] 标记流更改如下：

将 token 按字母顺序排序为 [ fox, quick, the, very, very, was ] 删除 very token 的重复实例。将 token 流连接到输出单个 token：[fox quick the very was]

此过滤器使用 Lucene 的 FingerprintFilter。例如：

以下 analyze API 请求使用指纹过滤器为文本 zebra jumps over resting resting dog 创建单个输出 token：

1. GET _analyze 2. { 3. "tokenizer" : "whitespace", 4. "filter" : ["fingerprint"], 5. "text" : "zebra jumps over resting resting dog" 6. } 复制代码

上面的命令返回：

1. { 2. "tokens": [ 3. { 4. "token": "dog jumps over resting zebra", 5. "start_offset": 0, 6. "end_offset": 36, 7. "type": "fingerprint", 8. "position": 0 9. } 10. ] 11. } 复制代码

请注意的是它返回仅有一个 token。

Reverse token filter

反转流中的每个 token。例如，你可以使用反向过滤器将 cat 更改为 tac。反转 token 对于基于后缀的搜索很有用，例如查找以 -ion 结尾的单词或按扩展名搜索文件名。请参阅我之前的文章 “Elasticsearch：正确使用 regexp 搜索”。这个过滤器使用 Lucene 的 ReverseStringFilter。

以下 analyze API 请求使用反向过滤器来反转 quick fox jumps 中的每个 token：

1. GET _analyze 2. { 3. "tokenizer" : "standard", 4. "filter" : ["reverse"], 5. "text" : "quick fox jumps" 6. } 复制代码

上面的过滤器返回：

` 1. { 2. "tokens": [3. {4. "token": "kciuq",5. "start_offset": 0,6. "end_offset": 5,7. "type": "",8. "position": 09. },10. {11. "token": "xof",12. "start_offset": 6,13. "end_offset": 9,14. "type": "",15. "position": 116. },17. {18. "token": "spmuj",19. "start_offset": 10,20. "end_offset": 15,21. "type": "",22. "position": 223. }24. ] 25. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

就如第一个例子中显示的那样，我们可以使用如下的例子来定义并使用这个 filter：

` 1. PUT reverse_example 2. { 3. "settings" : { 4. "analysis" : { 5. "analyzer" : { 6. "whitespace_reverse" : { 7. "tokenizer" : "whitespace", 8. "filter" : ["reverse"] 9. } 10. } 11. } 12. }, 13. "mappings": { 14. "properties": { 15. "text": { 16. "type": "text", 17. "analyzer": "whitespace_reverse" 18. } 19. } 20. } 21. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

我们可以使用如下的命令来写入一个文档：

1. PUT reverse_example/_doc/1 2. { 3. "text": "I like the speed of this network" 4. } 复制代码

我们可以使用如下的方法来进行搜索：

1. GET reverse_example/_search 2. { 3. "query": { 4. "regexp": { 5. "text": "krow.*" 6. } 7. } 8. } 复制代码

特别需要指出的是：当我们使用通配符在前面进行搜索时搜索比较慢。这个时候，我们可以选择 reverse 过滤器来进行反转。

Unique token filter

从流中删除重复的 token。例如，你可以使用 unique 过滤器将 the lazy lazy dog 更改为 the lazy dog。如果 only_on_same_position 参数设置为 true，则unique 过滤器仅删除同一位置的重复标记。

注意：当 only_on_same_position 为 true 时，unique 过滤器的工作方式与 remove_duplicates 过滤器相同。

我们使用如下的例子来进行展示：

1. GET _analyze 2. { 3. "tokenizer" : "whitespace", 4. "filter" : ["unique"], 5. "text" : "the quick fox jumps the lazy fox" 6. } 复制代码

上面的命令返回：

` 1. { 2. "tokens": [3. {4. "token": "the",5. "start_offset": 0,6. "end_offset": 3,7. "type": "word",8. "position": 09. },10. {11. "token": "quick",12. "start_offset": 4,13. "end_offset": 9,14. "type": "word",15. "position": 116. },17. {18. "token": "fox",19. "start_offset": 10,20. "end_offset": 13,21. "type": "word",22. "position": 223. },24. {25. "token": "jumps",26. "start_offset": 14,27. "end_offset": 19,28. "type": "word",29. "position": 330. },31. {32. "token": "lazy",33. "start_offset": 24,34. "end_offset": 28,35. "type": "word",36. "position": 437. }38. ] 39. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

显然上面的返回中，只有一个 fox，而在原文档中有两个 fox。我们可以通过如下的方式来添加这个过滤器到 analyzer 中：

1. PUT custom_unique_example 2. { 3. "settings" : { 4. "analysis" : { 5. "analyzer" : { 6. "standard_truncate" : { 7. "tokenizer" : "standard", 8. "filter" : ["unique"] 9. } 10. } 11. } 12. } 13. } 复制代码

我们可以通过如下的方式来定制这个过滤器：

` 1. PUT letter_unique_pos_example 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "letter_unique_pos": { 7. "tokenizer": "letter", 8. "filter": [ "unique_pos" ] 9. } 10. }, 11. "filter": { 12. "unique_pos": { 13. "type": "unique", 14. "only_on_same_position": true 15. } 16. } 17. } 18. } 19. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

在上面，我们设置 only_on_same_position 为 true。在默认的情况下，这个值为 false。

Length token filter

删除比指定字符长度更短或更长的标记。例如，你可以使用长度过滤器来排除短于 2 个字符的标记和长于 5 个字符的标记。此过滤器使用 Lucene 的 LengthFilter。

提示：长度过滤器删除整个 token。如果你希望将 token 缩短到特定长度，请使用 truncate 过滤器。

例如：

1. GET _analyze 2. { 3. "tokenizer": "whitespace", 4. "filter": [ 5. { 6. "type": "length", 7. "min": 0, 8. "max": 4 9. } 10. ], 11. "text": "the quick brown fox jumps over the lazy dog" 12. } 复制代码

上面的命令返回长度为 0 到 4 的 token：

` 1. { 2. "tokens": [3. {4. "token": "the",5. "start_offset": 0,6. "end_offset": 3,7. "type": "word",8. "position": 09. },10. {11. "token": "fox",12. "start_offset": 16,13. "end_offset": 19,14. "type": "word",15. "position": 316. },17. {18. "token": "over",19. "start_offset": 26,20. "end_offset": 30,21. "type": "word",22. "position": 523. },24. {25. "token": "the",26. "start_offset": 31,27. "end_offset": 34,28. "type": "word",29. "position": 630. },31. {32. "token": "lazy",33. "start_offset": 35,34. "end_offset": 39,35. "type": "word",36. "position": 737. },38. {39. "token": "dog",40. "start_offset": 40,41. "end_offset": 43,42. "type": "word",43. "position": 844. }45. ] 46. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

这个过滤器对中文也有很多的帮助。比如，我们只想有超过两个字的中文 token：

` 1. PUT twitter 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "my_analyzer": { 7. "tokenizer": "ik_smart", 8. "filter": [ 9. "longer_than_2" 10. ] 11. } 12. }, 13. "filter": { 14. "longer_than_2": { 15. "type": "length", 16. "min": 2 17. } 18. } 19. } 20. }, 21. "mappings": { 22. "properties": { 23. "text": { 24. "type": "text", 25. "analyzer": "my_analyzer" 26. } 27. } 28. } 29. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

我们使用上面的索引来进行测试：

1. GET twitter/_analyze 2. { 3. "analyzer": "my_analyzer", 4. "text": ["我爱北京天安门"] 5. } 复制代码

上面的命令返回：

` 1. { 2. "tokens": [3. {4. "token": "北京",5. "start_offset": 2,6. "end_offset": 4,7. "type": "CN_WORD",8. "position": 29. },10. {11. "token": "天安门",12. "start_offset": 4,13. "end_offset": 7,14. "type": "CN_WORD",15. "position": 316. }17. ] 18. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

显然它只返回 “北京” 及 “天安门” 这个两个长度超过 2 的 token。我们和如下的分词进行比较：

1. GET _analyze 2. { 3. "analyzer": "ik_smart", 4. "text": ["我爱北京天安门"] 5. } 复制代码 ` 1. { 2. "tokens": [3. {4. "token": "我",5. "start_offset": 0,6. "end_offset": 1,7. "type": "CN_CHAR",8. "position": 09. },10. {11. "token": "爱",12. "start_offset": 1,13. "end_offset": 2,14. "type": "CN_CHAR",15. "position": 116. },17. {18. "token": "北京",19. "start_offset": 2,20. "end_offset": 4,21. "type": "CN_WORD",22. "position": 223. },24. {25. "token": "天安门",26. "start_offset": 4,27. "end_offset": 7,28. "type": "CN_WORD",29. "position": 330. }31. ] 32. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

很显然，当我们写入如下的文档：

1. PUT twitter/_doc/1 2. { 3. "text": "我爱北京天安门" 4. } 复制代码

我们使用如下的命令来进行搜索：

1. GET twitter/_search 2. { 3. "query": { 4. "match": { 5. "text": "我" 6. } 7. } 8. } 复制代码

它将不返回任何的文档，但是如果我们使用如下的命令来进行搜索：

1. GET twitter/_search 2. { 3. "query": { 4. "match": { 5. "text": "北京" 6. } 7. } 8. } 复制代码

它将返回我们的文档。

Lower token filter

将 token 文本更改为小写。例如，你可以使用 lowercase 过滤器将 THE Lazy DoG 更改为 the lazy dog。

除了默认过滤器之外，lowercase token 过滤器还提供对 Lucene 的希腊语、爱尔兰语和土耳其语的特定语言小写过滤器的访问。

例如：

1. GET _analyze 2. { 3. "tokenizer" : "standard", 4. "filter" : ["lowercase"], 5. "text" : "THE Quick FoX JUMPs" 6. } 复制代码

上面的命令返回：

` 1. { 2. "tokens": [3. {4. "token": "the",5. "start_offset": 0,6. "end_offset": 3,7. "type": "",8. "position": 09. },10. {11. "token": "quick",12. "start_offset": 4,13. "end_offset": 9,14. "type": "",15. "position": 116. },17. {18. "token": "fox",19. "start_offset": 10,20. "end_offset": 13,21. "type": "",22. "position": 223. },24. {25. "token": "jumps",26. "start_offset": 14,27. "end_offset": 19,28. "type": "",29. "position": 330. }31. ] 32. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

很显然，它把上面的大写字母都变为小写的了。

Upercase token filter

将 token 文本更改为大写。例如，你可以使用大写过滤器将 Lazy Dog 更改为 THE LAZY DOG。此过滤器使用 Lucene 的 UpperCaseFilter。

以下 analyze API 请求使用默认 uppercase 过滤器将 Quick FoX JUMP 更改为大写：

1. GET _analyze 2. { 3. "tokenizer" : "standard", 4. "filter" : ["uppercase"], 5. "text" : "the Quick FoX JUMPs" 6. } 复制代码

上面的命令返回结果：

` 1. { 2. "tokens": [3. {4. "token": "THE",5. "start_offset": 0,6. "end_offset": 3,7. "type": "",8. "position": 09. },10. {11. "token": "QUICK",12. "start_offset": 4,13. "end_offset": 9,14. "type": "",15. "position": 116. },17. {18. "token": "FOX",19. "start_offset": 10,20. "end_offset": 13,21. "type": "",22. "position": 223. },24. {25. "token": "JUMPS",26. "start_offset": 14,27. "end_offset": 19,28. "type": "",29. "position": 330. }31. ] 32. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

以下创建索引 API 请求使用 upercase 过滤器来配置新的自定义分词器。

1. PUT uppercase_example 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "whitespace_uppercase": { 7. "tokenizer": "whitespace", 8. "filter": [ "uppercase" ] 9. } 10. } 11. } 12. } 13. } 复制代码 Stop token filter

从 token 流中删除停用词。未自定义时，过滤器默认删除以下英文停用词：

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with 复制代码

除了英语之外，stop 过滤器还支持多种语言的预定义 stop 词列表。你还可以将自己的 stop 词指定为数组或文件。stop 过滤器使用 Lucene 的 StopFilter。

例子：

下面的 analyze API 请求使用 stop 过滤器从 a quick fox jumps over the lazy dog 来删除停用词 a 和 the：

1. GET /_analyze 2. { 3. "tokenizer": "standard", 4. "filter": [ "stop" ], 5. "text": "a quick fox jumps over the lazy dog" 6. } 复制代码

上面的命令返回结果：

` 1. { 2. "tokens": [3. {4. "token": "quick",5. "start_offset": 2,6. "end_offset": 7,7. "type": "",8. "position": 19. },10. {11. "token": "fox",12. "start_offset": 8,13. "end_offset": 11,14. "type": "",15. "position": 216. },17. {18. "token": "jumps",19. "start_offset": 12,20. "end_offset": 17,21. "type": "",22. "position": 323. },24. {25. "token": "over",26. "start_offset": 18,27. "end_offset": 22,28. "type": "",29. "position": 430. },31. {32. "token": "lazy",33. "start_offset": 27,34. "end_offset": 31,35. "type": "",36. "position": 637. },38. {39. "token": "dog",40. "start_offset": 32,41. "end_offset": 35,42. "type": "",43. "position": 744. }45. ] 46. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

从上面的输出结果中，我们可以看到 a 及 the 没有出现。

如果我们想自定义 stop 过滤器，我们可以仿照如下的方法：

` 1. PUT stop_example 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "my_analyzer": { 7. "tokenizer": "standard", 8. "filter": [ 9. "my_stop" 10. ] 11. } 12. }, 13. "filter": { 14. "my_stop": { 15. "type": "stop", 16. "stopwords": [ 17. "over", 18. "dog" 19. ] 20. } 21. } 22. } 23. }, 24. "mappings": { 25. "properties": { 26. "text": { 27. "type": "text", 28. "analyzer": "my_analyzer" 29. } 30. } 31. } 32. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

在上面，我们自定义了一个叫做 my_stop 的 stop 过滤器。它自己含有 over 及 dog。也就是说 over 及 dog 将不会被分词。我们使用如下的命令来进行测试：

1. GET stop_example/_analyze 2. { 3. "analyzer": "my_analyzer", 4. "text": ["a quick fox jumps over the lazy dog"] 5. } 复制代码

上面的命令返回结果：

` 1. { 2. "tokens": [3. {4. "token": "a",5. "start_offset": 0,6. "end_offset": 1,7. "type": "",8. "position": 09. },10. {11. "token": "quick",12. "start_offset": 2,13. "end_offset": 7,14. "type": "",15. "position": 116. },17. {18. "token": "fox",19. "start_offset": 8,20. "end_offset": 11,21. "type": "",22. "position": 223. },24. {25. "token": "jumps",26. "start_offset": 12,27. "end_offset": 17,28. "type": "",29. "position": 330. },31. {32. "token": "the",33. "start_offset": 23,34. "end_offset": 26,35. "type": "",36. "position": 537. },38. {39. "token": "lazy",40. "start_offset": 27,41. "end_offset": 31,42. "type": "",43. "position": 644. }45. ] 46. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

从上面的结果中，我们可以看出来， dog 及 over 不见了。当然，我们也看到了 a 及 the 又同时出现了。如果我们还是想保持之前的默认的 stop 过滤器，我们可以重新设计 stop_example 索引：

` 1. DELETE stop_example 3. PUT stop_example 4. { 5. "settings": { 6. "analysis": { 7. "analyzer": { 8. "my_analyzer": { 9. "tokenizer": "standard", 10. "filter": [ 11. "my_stop", 12. "stop" 13. ] 14. } 15. }, 16. "filter": { 17. "my_stop": { 18. "type": "stop", 19. "stopwords": [ 20. "over", 21. "dog" 22. ] 23. } 24. } 25. } 26. }, 27. "mappings": { 28. "properties": { 29. "text": { 30. "type": "text", 31. "analyzer": "my_analyzer" 32. } 33. } 34. } 35. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

由于 filter 可以是多个，而且它的执行顺序是从上而下执行的。在上面，它先执行 my_stop，然后是 stop 过滤器。我们使用同样的命令来进行测试：

1. GET stop_example/_analyze 2. { 3. "analyzer": "my_analyzer", 4. "text": ["a quick fox jumps over the lazy dog"] 5. } 复制代码

这次显示的结果为：

` 1. { 2. "tokens": [3. {4. "token": "quick",5. "start_offset": 2,6. "end_offset": 7,7. "type": "",8. "position": 19. },10. {11. "token": "fox",12. "start_offset": 8,13. "end_offset": 11,14. "type": "",15. "position": 216. },17. {18. "token": "jumps",19. "start_offset": 12,20. "end_offset": 17,21. "type": "",22. "position": 323. },24. {25. "token": "lazy",26. "start_offset": 27,27. "end_offset": 31,28. "type": "",29. "position": 630. }31. ] 32. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

显然，a 和 the 不见了。

Predicate script token filter

删除与提供的谓词脚本不匹配的标记。该过滤器仅支持内联 Painless 脚本。在分词谓词上下文中评估脚本。

示例：

以下 analyze API 请求使用 predicate_token_filter 过滤器仅输出c从 the fox jumps the lazy dog 长于三个字符的 token。

` 1. GET /_analyze 2. { 3. "tokenizer": "whitespace", 4. "filter": [ 5. { 6. "type": "predicate_token_filter", 7. "script": { 8. "source": """ 9. token.term.length() > 3 10. """ 11. } 12. } 13. ], 14. "text": "the fox jumps the lazy dog" 15. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

上面的命令返回的结果为：

` 1. { 2. "tokens": [3. {4. "token": "jumps",5. "start_offset": 8,6. "end_offset": 13,7. "type": "word",8. "position": 29. },10. {11. "token": "lazy",12. "start_offset": 18,13. "end_offset": 22,14. "type": "word",15. "position": 416. }17. ] 18. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

事实上，我们可以使用这个过滤器来实现上面描述的 length 过滤器。比如：

` 1. PUT predicate_example 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "my_analyzer": { 7. "tokenizer": "ik_smart", 8. "filter": [ 9. "my_predicate" 10. ] 11. } 12. }, 13. "filter": { 14. "my_predicate": { 15. "type": "predicate_token_filter", 16. "script": { 17. "source": """ 18. token.term.length() > 1 19. """ 20. } 21. } 22. } 23. } 24. }, 25. "mappings": { 26. "properties": { 27. "text": { 28. "type": "text", 29. "analyzer": "my_analyzer" 30. } 31. } 32. } 33. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

我们使用如下的命令来进行测试：

1. GET predicate_example/_analyze 2. { 3. "analyzer": "my_analyzer", 4. "text": ["我爱北京天安门"] 5. } 复制代码 ` 1. { 2. "tokens": [3. {4. "token": "北京",5. "start_offset": 2,6. "end_offset": 4,7. "type": "CN_WORD",8. "position": 29. },10. {11. "token": "天安门",12. "start_offset": 4,13. "end_offset": 7,14. "type": "CN_WORD",15. "position": 316. }17. ] 18. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

显然只有长度是大于 1 的 token 才会显示。在实际的使用中，脚本的执行速度较慢一些。如果你的脚本的长度或计算的时间较长，我们需要注意使用。

自定义并添加到分词器

要自定义 predicate_token_filter 过滤器，复制它以创建新的自定义 token 过滤器的基础。你可以使用其可配置参数修改过滤器。

以下创建索引 API 请求使用自定义 predicate_token_filter 过滤器 my_script_filter 配置新的自定义分词器。

my_script_filter 过滤器删除除 ALPHANUM 之外的任何类型的 token。

` 1. PUT /my-index-000001 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "my_analyzer": { 7. "tokenizer": "standard", 8. "filter": [ 9. "my_script_filter" 10. ] 11. } 12. }, 13. "filter": { 14. "my_script_filter": { 15. "type": "predicate_token_filter", 16. "script": { 17. "source": """ 18. token.type.contains("ALPHANUM") 19. """ 20. } 21. } 22. } 23. } 24. } 25. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码 Trim token filter

从流中的每个 token 中删除前导和尾随空格。虽然这可以更改 token 的长度，但 trim 过滤器不会更改 token 的偏移量。 trim 过滤器使用 Lucene 的 TrimFilter。

例子：

要查看 trim 过滤器的工作原理，你首先需要生成一个包含空格的 token。以下 analyze API 请求使用 keyword tokenizer 为“ fox ”生成 token。

1. GET _analyze 2. { 3. "tokenizer" : "keyword", 4. "text" : " fox " 5. } 复制代码

API 返回以下响应。请注意，“ fox ” token 包含原始文本的空格。请注意，尽管更改了 token 的长度，但 start_offset 和 end_offset 保持不变。

1. { 2. "tokens": [ 3. { 4. "token": " fox ", 5. "start_offset": 0, 6. "end_offset": 5, 7. "type": "word", 8. "position": 0 9. } 10. ] 11. } 复制代码

要删除空格，请将 trim 过滤器添加到之前的 analyze API 请求。

1. GET _analyze 2. { 3. "tokenizer" : "keyword", 4. "filter" : ["trim"], 5. "text" : " fox " 6. } 复制代码

API 返回以下响应。返回的 fox token 不包含任何前导或尾随空格。

1. { 2. "tokens": [ 3. { 4. "token": "fox", 5. "start_offset": 0, 6. "end_offset": 5, 7. "type": "word", 8. "position": 0 9. } 10. ] 11. } 复制代码

以下创建索引 API 请求使用 trim 过滤器来配置新的自定义分词器。

1. PUT trim_example 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "keyword_trim": { 7. "tokenizer": "keyword", 8. "filter": [ "trim" ] 9. } 10. } 11. } 12. } 13. } 复制代码 Synonym token filter

这是一个同义词过滤器。你可以阅读我之前的文章 “Elasticsearch：使用同义词 synonyms 来提高搜索效率”。

Truncate token filter

Truncate 超过指定字符限制的标记。此限制默认为 10，但可以使用长度参数进行自定义。例如，你可以使用 truncate 过滤器将所有 token 缩短为 3 个字符或更少，将 jumping fox 更改为 jum fox。此过滤器使用 Lucene 的 TruncateTokenFilter。

例子：

以下 analyze API 请求使用 truncate 过滤器来缩短 quinquennial extravaganza carried on 中超过 10 个字符的标记：

1. GET _analyze 2. { 3. "tokenizer" : "whitespace", 4. "filter" : ["truncate"], 5. "text" : "the quinquennial extravaganza carried on" 6. } 复制代码

上面的命令返回：

` 1. { 2. "tokens": [3. {4. "token": "the",5. "start_offset": 0,6. "end_offset": 3,7. "type": "word",8. "position": 09. },10. {11. "token": "quinquenni",12. "start_offset": 4,13. "end_offset": 16,14. "type": "word",15. "position": 116. },17. {18. "token": "extravagan",19. "start_offset": 17,20. "end_offset": 29,21. "type": "word",22. "position": 223. },24. {25. "token": "carried",26. "start_offset": 30,27. "end_offset": 37,28. "type": "word",29. "position": 330. },31. {32. "token": "on",33. "start_offset": 38,34. "end_offset": 40,35. "type": "word",36. "position": 437. }38. ] 39. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

我们可以发现 token 的长度不超过 10。10 是默认的值。

添加到分词仪

以下创建索引 API 请求使用截断过滤器来配置新的自定义分词器。

1. PUT custom_truncate_example 2. { 3. "settings" : { 4. "analysis" : { 5. "analyzer" : { 6. "standard_truncate" : { 7. "tokenizer" : "standard", 8. "filter" : ["truncate"] 9. } 10. } 11. } 12. } 13. } 复制代码

我们甚至可以针对 truncate 过滤器进行定制：

` 1. PUT 5_char_words_example 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "lowercase_5_char": { 7. "tokenizer": "lowercase", 8. "filter": [ "5_char_trunc" ] 9. } 10. }, 11. "filter": { 12. "5_char_trunc": { 13. "type": "truncate", 14. "length": 5 15. } 16. } 17. } 18. } 19. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

通过对 length 参数的定制，我们可以限制 token 的长度最多不超过 5，而不是默认的 10。

Limit token count token filter

限制输出 token 的数量。限制过滤器通常用于根据 token 计数限制文档字段值的大小。默认情况下，限制过滤器仅保留流中的第一个 token。例如，过滤器可以将 token 流 [one、two、three] 更改为 [one]。此过滤器使用 Lucene 的 LimitTokenCountFilter。

示例：

1. GET _analyze 2. { 3. "tokenizer": "standard", 4. "filter": [ 5. { 6. "type": "limit", 7. "max_token_count": 2 8. } 9. ], 10. "text": "quick fox jumps over lazy dog" 11. } 复制代码

上面的输出为：

` 1. { 2. "tokens": [3. {4. "token": "quick",5. "start_offset": 0,6. "end_offset": 5,7. "type": "",8. "position": 09. },10. {11. "token": "fox",12. "start_offset": 6,13. "end_offset": 9,14. "type": "",15. "position": 116. }17. ] 18. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

也就是取前面的两个 token。

添加到分词器

以下创建索引 API 请求使用 limit 过滤器来配置新的自定义分词器。

1. PUT limit_example 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "standard_one_token_limit": { 7. "tokenizer": "standard", 8. "filter": [ "limit" ] 9. } 10. } 11. } 12. } 13. } 复制代码

我们也可以定制 limit 过滤器：

` 1. PUT custom_limit_example 2. { 3. "settings": { 4. "analysis": { 5. "analyzer": { 6. "whitespace_five_token_limit": { 7. "tokenizer": "whitespace", 8. "filter": [ "five_token_limit" ] 9. } 10. }, 11. "filter": { 12. "five_token_limit": { 13. "type": "limit", 14. "max_token_count": 5 15. } 16. } 17. } 18. } 19. } `![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png) 复制代码

上面的 max_token_count 在默认的情况下为 1。我们可以通过对它的定制来限制最多的 token 输出。

Shingle token filter

请详细阅读之前的文章 “Elasticsearch: Ngrams, edge ngrams, and shingles”。

中文相关的 filter

请详细阅读我之前的文章：

Elasticsearch：IK 中文分词器

Elasticsearch：Pinyin 分词器

Elasticsearch：简体繁体转换分词器 - STConvert analysis

好了，今天的分享就到这里。更多的过滤器，请参阅官方文档。

参考：

【1】Token filter reference | Elasticsearch Guide [8.4] | Elastic

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章